--- title: EDA author: Ling Huang date: '2021-03-29' slug: [] categories: [] tags: [] ---

1. Overview

This post is served as a sub-module of Visual Analytics project. I aim to leverage on the time series analysis techniques and interactivity approaches in R to present the possible visualizations of US Market stocks.

Throughout this exercise, I mainly use the “tidyverse”, “tidyquant”, “timetk”, “TSstudio” and “forecast” packages in R to explore the patterns of the stock prices and the transaction volumes. Basically, it will consists of Single Time Series Analysis, Multiple Time Series Analysis and Auto-correlation Plots. The entire project will incorporate the Machine Learning and ARIMA Model Forecasting, thus this sub-module is the preliminary data exploration to interpret the data behaviors and patterns and the pre-processing step for further analysis.

2. Literature Review

According to APTECH, Time series data is data that is collected at different points in time.

A few examples are shown as below. (Reference: APTECH )

Generally speaking, the Time Series data has six basic patterns:

Thus, visualizing time series data provides a preliminary tool for detecting if data:

In this project, we target on the US Market stocks as the Stock prices and transaction volumes are sort of time series data. In addition, we would try to do Forecasting on the Stock prices (in another sub-module).

In this sub-module, I will start off with visualizing some stocks’ prices and transaction volumes as Exploratory Data Analysis.

Then, I will pick one stock and use Autocorrelation function (ACF) and Partial Autocorrelation function (PACF) to stationize it. Why do we need to stationize the data? Well, in general, the stock prices and volumes are not stationary data, Thus in order to do Forecasting afterwards, making it stationary is a must-have processing step.

3. Exploring and Visualizing the stock data

Check the required R packages and load them

packages = c('timetk', 'modeltime', 'tidymodels', 'lubridate', 'tidyverse', 'tidyquant', 'TSstudio', 'forecast')

for(p in packages){
  if(!require(p, character.only = T)){
    install.packages(p)
  }
  library(p, character.only = T)
}

Load the data

We selected 8 stocks from 4 different sectors.

In this paper, we focus on the period from 2015 April 1 to 2021 March 31.

tq_get() function is used to retrieve the stock prices and transaction volume.

stocks = c('AAL', 'SAVE', 'BAC', 'JPM', 'JNJ', 'PFE', 'MSFT', 'AAPL')

startdate <- "2015-04-01"
enddate <- "2021-03-31"

data <- data.frame()

for(s in stocks){
  newstock <- tq_get(s, get = "stock.prices", from = startdate, to  = enddate)
  data <- rbind(data, newstock)

}

Scatterplot of 8 stocks’ prices

Let’s display them by 2 columns.

data %>%
  group_by(symbol) %>%
  plot_time_series(date, adjusted,
                   .color_var = year(date),  
                   .facet_ncol = 2,
                   .interactive = T,
                   .y_intercept = 0,
                   .title = "Stocks Price",
                   # .x_lab = "Date",
                   # .y_lab = "Price (US$)",
                   .color_lab = "Year",
                   .plotly_slider = FALSE) 
<<<<<<< HEAD ======= >>>>>>> 272af2522d64ad4c1f8db60705201f7b9713a5e1

Scatterplot of 8 stocks’ transaction volumes

Let’s display them by 2 columns.

data %>%
  group_by(symbol) %>%
  summarise_by_time(
    date, .by = "month",
    volume = SUM(volume)
  ) %>%
  plot_time_series(date, volume, 
                   .facet_vars   = contains("symbol"),
                   .title = "Transaction Volume",
                   .facet_ncol = 2, .interactive = T, .y_intercept = 0)
## plot_time_series(...): Groups are previously detected. Grouping by: symbol
<<<<<<< HEAD ======= >>>>>>> 272af2522d64ad4c1f8db60705201f7b9713a5e1

Scatterplot of the stock price - Weekly Trend

We can aggregate the data by weekly basis. Here, I choose American Airlines as an example.

data %>%
  filter(symbol == "AAL") %>%
  summarise_by_time(
    date, .by = "week",
    meanadjusted = mean(adjusted)
    ) %>%
  plot_time_series(date, meanadjusted, .interactive = T, .y_intercept = 0)
<<<<<<< HEAD ======= >>>>>>> 272af2522d64ad4c1f8db60705201f7b9713a5e1

Scatterplot of the stock price - Monthly Trend

We can aggregate the data by monthly basis. Here, I choose Spirit Airlines as an example.

data %>%
  filter(symbol == "SAVE") %>%
  summarise_by_time(
    date, .by = "month",
    meanadjusted = mean(adjusted)
    ) %>%
  plot_time_series(date, meanadjusted, .interactive = T, .y_intercept = 0)
<<<<<<< HEAD ======= >>>>>>> 272af2522d64ad4c1f8db60705201f7b9713a5e1

Interactive scatterplot of the stock price

Here, I choose Johnson & Johnson as an example.

JNJ <- data %>%
  filter(symbol == "JNJ") %>%
  select("date", "adjusted")
  
ts_plot(JNJ,
        title = "Johnson and Johnson",
        Xtitle = "Date",
        Ytitle = "Price",
        color = "blue",
        slider = TRUE,
        Xgrid = TRUE,
        Ygrid = TRUE)
<<<<<<< HEAD ======= >>>>>>> 272af2522d64ad4c1f8db60705201f7b9713a5e1

Interactive scatterplot of the stock volume

Here, I illustrate with Apple Inc. stock.

apple <- data %>%
  filter(symbol == "AAPL") %>%
  select("date", "volume")
  
ts_plot(apple,
        title = "Apple Inc.",
        Xtitle = "Date",
        Ytitle = "Volume",
        color = "pink",
        slider = TRUE,
        Xgrid = TRUE,
        Ygrid = TRUE)
<<<<<<< HEAD ======= >>>>>>> 272af2522d64ad4c1f8db60705201f7b9713a5e1

4. Stationize the data

Let’s pick one stock “JP Morgan” from Financials sector and illustrate how we can make the data stationary.

Visualize the JP Morgan price

jpmorgan <- data %>%
  filter(symbol == "JPM")

jpmorgan %>%
plot_time_series(date, adjusted, .color_var = year(date), .interactive = T)
<<<<<<< HEAD ======= >>>>>>> 272af2522d64ad4c1f8db60705201f7b9713a5e1

Assesse its ACF and PACF

In R this is done with the appropriately named acf and pacf functions.

The ACF shows the correlation of a time series with lags of itself. That is, how much the time series is correlated with itself at one lag, at two lags, at three lags and so on.

acf(jpmorgan$adjusted)

The PACF is a little more complicated. The autocorrelation at lag one can have lingering effects on the autocorrelation at lag two and onward. The partial autocorrelation is the amount of correlation between a time series and lags of itself that is not explained by a previous lag. So, the partial autocorrelation at lag two is the correlation between the time series and its second lag that is not explained by the first lag.

pacf(jpmorgan$adjusted)

Use the Differencing technique

Differencing a time series method, to subtract each data point in the series from its successor. It is commonly used to make a time series stationary. Besides, if the time series appears to be seasonal, a better approach is to difference with respective season’s data points to remove seasonal effect.

But, how should we know how many differencing is needed? the nsdiffs and ndiffs from forecast package can help find out how many seasonal differencing and regular differencing respectively is needed to make the series stationary. (Note: For most time series patterns, 1 or 2 differencing is necessary to make it a stationary series.)

Seasonal Differencing
# nsdiffs(jpmprice$adjusted)  # number for seasonal differencing needed
#> Error in nsdiffs(jpmprice$adjusted) : Non seasonal data
Regular Differencing
ndiffs(jpmorgan$adjusted)  # number of differences needed
## [1] 1
Make it stationary
stationaryTS <- diff(jpmorgan$adjusted, differences= 1)
plot(stationaryTS, type="l", main="Differenced and Stationary")  # appears to be stationary

5. Observations & Suggestions

The airlines’ stock prices dropped dramatically due to the Covid-19 in year 2020 whereas the stock prices of Healthcare sector and Information Technology sector skyrocketed because Covid-19 boosts the usage of healthcare products and IT products.

From the 1st observation, I suggest that we need to show a few different stock sectors for the users to select when we build the final Shiny App.

In other words, the stocks in the same sector show the same trends and fluctuations. For instance, if we compare the Apple Inc.’s price scatterplot with Microsoft Corporation one in the past 5 years, their performance are almost the same. If we check the American Airlines and Spirit Airlines’s prices scatterplots, they look almost the same too. Probably the stocks in the same sector would be affected by the sector’s outlook and investor’s sentiments on the sector. If the investors favor one sector, most stocks in this sector would benefit.

From the 2nd observation, I suggest that we may also need to allow users to select several stocks from the same sector in the final Shiny App.

This could be explained by the Covid-19 factor too. Investors rushed to sell off the stocks negatively affected by Covid-19 and buy in the stocks positively affected by Covid-19 substantially.

From the 3rd observation, we should display the stock’s transaction volume data as well as price data in the final Shiny App as both can tell some meaningful insights.

From the 4th observation, we can present the stationized plots (ACF and PACF charts) along with a little explanations.

On top of the ideas above, as the key components of EDA, in my view, we can make the Start-date and End-date as calender view selections so that the users will be able to choose any time period as they wish.

6. Storyboard for the design of the sub-module.

After examining and exploring the stock data, I propose to EDA the data in the following ways. 1. xx 2. xx 3. xxx

The storyboard for the design is attached as below.

7. References